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Perturbation theory for Markov chains addresses the question of how small differences in the 
transition probabilities of Markov chains are reflected in differences between their distribu¬ 
tions. We prove powerful and flexible bounds on the distance of the nth step distributions of 
two Markov chains when one of them satisfies a Wasserstein ergodicity condition. Our work is 
motivated by the recent interest in approximate Markov chain Monte Carlo (MCMC) meth¬ 
ods in the analysis of big data sets. By using an approach based on Lyapunov functions, we 
provide estimates for geometrically ergodic Markov chains under weak assumptions. In an au¬ 
toregressive model, our bounds cannot be improved in general. We illustrate our theory by 
showing quantitative estimates for approximate versions of two prominent MCMC algorithms, 
the Metropolis-Hastings and stochastic Langevin algorithms. 
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1. Introduction 

Markov chain Monte Carlo (MCMC) algorithms are one of the key tools in computational 
statistics. They are used for the approximation of expectations with respect to probability 
measures given by unnormalized densities. For almost all classical MCMC methods it is 
essential to evaluate the target density. In many cases, this requirement is not an issue, 
but there are also important applications where it is a problem. This includes applications 
where the density is not available in closed form, see [27], or where an exact evaluation is 
computationally too demanding, see [2]. Problems of this kind lead to the approximation 
of Markov chains and to the question of how small differences in the transitions of two 
Markov chains affect the differences between their distributions. 

In Bayesian inference when big data sets are involved an exact evaluation of the target 
density is typically very expensive. For instance, in each step of a Metropolis-Hastings 
algorithm the likelihood of a proposed state must be computed. Every observation in 
the underlying data set contributes to the likelihood and must be taken into account 
in the calculation. This may result in evaluating several terabytes of data in each step 
of the algorithm. These are the reasons for the recent interest in numerically cheaper 
approximations of classical MCMC methods, see [3, 4, 23, 42, 47]. A reduction of the 
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computational costs can, e.g., be achieved by relying on a moderately sized random sub¬ 
sample of the data in each step of the algorithm. The function value of the target density 
is thus replaced by an approximation. Naturally, subsampling and alternative attempts 
at “cutting the Metropolis-Hastings budget” [23] induce additional biases. These biases 
can lead to dramatic changes in the properties of the algorithms as discussed in [6]. 

We thus need a better theoretical understanding of the behavior of such approximate 
MCMC methods. Indeed, a number of recent papers prove estimates of these biases, see 
[2, 3, 19, 24, 29, 35]. A key tool in these papers are perturbation bounds for Markov chains. 
One such result for uniformly ergodic Markov chains due to Mitrophanov [33] is used 
in [2]. A similar perturbation estimate implicitly appears in [3]. The focus on uniformly 
ergodic Markov chains is rather restrictive, especially for high-dimensional, non-compact 
state spaces such as R™. Working with Wasserstein distances has recently turned out to 
be a fruitful alternative in several contributions on high-dimensional MCMC algorithms, 
see [11, 12, 14, 18, 25]. 

We provide perturbation bounds based on Wasserstein distances, which lead to flexi¬ 
ble quantitative estimates of the biases of approximate MCMC methods. Our first main 
result is the Wasserstein perturbation bound of Theorem 3.1. Under a Wasserstein er- 
godicity assumption, explained in Section 2, it provides an upper bound on the distance 
of the nth step distribution between an ideal and an approximating Markov chain in 
terms of the difference between their one-step transition probabilities. The result is well- 
suited for applications on a non-compact state space, since the difference of the one-step 
transition probabilities is measured by a weighted supremum with respect to a suitable 
Lyapunov function. For an autoregressive model, we show in Section 4.1 that the resulting 
perturbation bound cannot be improved in general. As a consequence of the Wasserstein 
approach we also obtain perturbation estimates for geometrically ergodic Markov chains. 
We first adapt our Wasserstein perturbation bound to this setting. Then, as a second 
main result. Theorem 3.2, we prove a refined estimate for geometrically ergodic chains 
where the perturbation is measured by a weighted total variation distance. Our per¬ 
turbation bounds, and earlier ones in [32, 33], establish a direct connection between an 
exponential convergence property for Markov chains and their robustness to perturba¬ 
tions. In particular, fast convergence to stationarity implies insensitivity to perturbations 
in the transition probabilities. Geometric ergodicity has been studied extensively in the 
MCMC literature. Thus, our estimates can be used in combination with many existing 
convergence results for MCMC algorithms. In Section 4, we illustrate the applicability 
of both theorems by generalizing recent findings on approximate Metropolis-Hastings 
algorithms from [3] and on noisy Langevin algorithms for Gibbs random fields from [2]. 

1.1. Related literature 

We refer to [20, 21] for an overview of the classical literature on perturbation theory 
for Markov chains. However, as Stuart and Shardlow observed in [41], the classical as¬ 
sumptions on the perturbation might be too restrictive for many interesting applications. 
As a consequence, they develop a perturbation theory for geometrically ergodic Markov 
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chains [41] which requires to control perturbations of iterated transition kernels in a 
weaker sense. In our bounds for geometrically ergodic Markov chains, we have similar 
flexibility in the perturbation due to the Lyapunov-type stability condition, and require 
only a control on the errors of one-step transition kernels. 

Mitrophanov, in [33], considers uniformly ergodic Markov chains and provides the 
best estimates in those settings. In the geometrically ergodic case, there are further 
related results, see [13] and the references therein. Compared to [13], our focus is on non- 
asymptotic estimates with explicit constants, while their main focus is on qualitative 
results such as inheritance of geometric ergodicity by the perturbation. Earlier related 
results on perturbations induced by floating-point roundoff errors are shown in [7, 38]. 

Finally, let us point out that our paper is complementary to the work of Pillai and 
Smith [35] who also present Wasserstein perturbation bounds for Markov chains. When 
moving beyond the uniformly ergodic Markov chain case, an important challenge is to 
handle the issue that in many applications suprema of relevant quantities over the whole 
state space are infinite. The authors of [35] guarantee finiteness of supremum norms 
by restricting attention to subsets of the state space. Their bounds thus involve exit 
probabilities from these subsets. Our approach circumvents these issues by relying on 
Lyapunov-type stability conditions for the approximate algorithm. 


2. Wasserstein ergodicity 

Let G be a Polish space and B{G) be the corresponding Borel tr-algebra. Let d be a 
metric, possibly different from the one which makes the space Polish, which is assumed 
to be lower semi-continuous with respect to the product topology of G. Let V be the set 
of all Borel probability measures on (G, B{G)). Then, we define the Wasserstein distance 
oiv,yL&'P hy 

W{v,y)= inf ( f d(x,y) df{x,y), 

where M(y, fi) is the set of all couplings of v and /i, that is, all probability measures f 
on G X G with marginals z/ and /r. Indeed, on V the Wasserstein distance satisfies the 
properties of a metric but is not necessarily finite, see [46, Chapter 6]. For a measurable 
function /: G —>■ K we define 


ll/ll 


Lip 


sup 

x,y^G,x^y 


\f{x) - f{y)\ 

d{x,y) 


which leads to the well-known duality formula 


W(z/,/r) 


sup 

ll/llLip<l 



f{x){dn{x) - dn{x)) 


( 2 . 1 ) 


For details we refer to [45, Chapter 1.2]. By Sx we denote the probability measure con¬ 
centrated at X. Hence IF((5a;,(5y) = d{x,y) is finite for x,y € G. 
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Let P be a transition kernel on {G,B{G)) which defines a linear operator P: V ^ V 
given by 

fiP{A) = ( P(x, A) d/r(x), ^GV,AgB{G). 

Jg 

With this notation we have SxP{A) = P{x,A). Further, for a measurable function 
/: G —>■ R and /i G P we have 


f{x)d{fiP){x) 


Pf{x) dfi{x), 


with Pf{x) = Jq f{y)P{x, dy) whenever one of the integrals exist, see for example [40, 
Lemma 3.6]. Now, by 


r(P) := 


WiSxP,dyP) 
sup -f— 

x,y^G,x^y ^1^5 yj 


we define the generalized ergodicity coefficient of transition kernel P. This coefficient can 
be understood as a generalized Dobrushin ergodicity coefficient, see [8, 9]. Dobrushin 
himself called t(P) the Kantorovich norm of P, see [10, formula (14.34)]. Finally, t(P) 
also provides a lower bound of the coarse Ricci curvature of P introduced in [34]. 

Two essential properties of the ergodicity coefficient are submultiplicativity and con- 
tractivity, see [10, Proposition 14.3 and Proposition 14.4]. 


Proposition 2.1. For two transition kernels P and P on {G,B{G)) and y,,v we 
have 


t{PP) < t{P)t{P) (Submultiplicativity), 
and W{vP, pP) < t{P) W(iz, p) (Contractivity). 


As an immediate consequence of this contractivity, we obtain the following corollary. 


Corollary 2.1. Let P be a transition kernel with stationary distribution tt, i.e. ttP = tt, 
and assume for some (and hence any) xq G G it holds that d(xo, x) d7r(x) < oo. Then 


W{5xP,tt) 


< r{P). 


( 2 . 2 ) 


Proof. Because of the assumption d{xo,x) d7r(x) < oo we have that W{Sx,tt) is finite 
for any x € G. Thus, the assertion follows by Proposition 2.1 and stationarity of tt. □ 


Remark 2.1. For some special cases one also has an estimate of the form (2.2) in the 
other direction. To this end, consider the trivial metric d{x, y) = 2 ■ Ix/y with indicator 
function 


1 


x^y 


x^y 

x = y. 
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Further, let 


Iklltv — sup 
ll/lloo<l 


[ /(y)dg(y) 

Jg 


2 sup |g(A)| 
AeB{G) 


be the total variation norm of a signed measure q on G. In this setting W (/i, v) = 
ll/i — For a;,?/ £ G with x ^ y we have = d{x,y) = 2 so that 


p{P) = \ sup ¥xP-SyP\\^^. 

^ x,y^G,x^y 


(2.3) 


The “1” in the subscript of ti{P) indicates that we use the trivial metric. By applying the 
triangle inequality of the total variation norm we obtain ti{P) < sup,,,gg. \\5xP — If 
additionally tt is atom-free, i.e., 7r({y}) = 0 for all y £ G, we have \\5y — 7r||j^ = 2. Then, 
the previous consideration and (2.2) lead to 

i SUpIl^F- 7r||^^ < Tl(P) < sup \\SxP - TT\\tv ■ 

^ xeG xeG 


For the moment, let us assume that P is uniformly ergodic, that is, there exist numbers 
p £ [0,1) and G £ (0, oo) such that 


sup II4-P” - 7r||t^ < Gp”, n £ N. 
x^G 


An immediate consequence of the uniform ergodicity is that ti{P'^) < Cp"^. 


Also note that if there is an no £ N for which t{P'^° ) < 1 we have by the submultiplica- 
tivity, see Proposition 2.1, that t{P'^) converges exponentially to zero. This motivates 
to impose the following assumption which contains the idea to measure convergence of 
SxP"' to TT in terms of t{P"). 


Assumption 2.1 (Wasserstein ergodicity). For the transition kernel P there exist num¬ 
bers p £ [0,1) and C £ (0,oo) such that 


r{P-) 


sup 

X ,y^G ,x^y 


IF(P"(a;,-),P"(y,-)) 

d{x,y) 


< Cp'^, n £ N. 


(2.4) 


For any probability measure po € V, a transition kernel P with stationary distribution 
TT and pn = PqP" we have under the Wasserstein ergodicity condition that 

W{pn,TT)<Cp’^W{po,TT). 


3. Perturbation bounds 

By No = {0,1, 2,...} we denote the non-negative integers and assume that all random 
variables are defined on a common probability space (II, P, P) mapping to a Polish space 


imsart-bj ver. 2014/10/16 file: wassersteiii_stylel601.tex date: February 27, 2017 





6 


D. Rudolf and N. Schweizer 


G equipped with a lower semi-continuous metric d. Let the sequence of random variables 
(-^n)nGNo be a Markov chain with transition kernel P and initial distribution po, i.e., we 
have almost surely 

P(X„ € A I Xo,..., Xn-i) = P(x„ e A I X„_i) = A), n e N 

and po(A) = P(Alo € A) for any measurable set A C G. Assume that is another 

Markov chain with transition kernel P and initial^distribution pQ. We denote by pn the 
distribution of A„ and by p„ the distribution of Throughout the paper, (A„)„gN is 
considered to be the ideal, unperturbed Markov chain we would like to simulate while 
(X„)„gNo is the perturbed Markov chain that we actually implement. 


3.1. Wasserstein perturbation bound 

Similar as in [33, Theorem 3.1], we show quantitative bounds on the difference of p„ and 
Pn, but use the Wasserstein distance instead of total variation. Besides Assumption 2.1, 
the bounds depend on the difference of the initial distributions and on a suitably weighted 
one-step difference between P and P. 


Theorem 3.1 (Wasserstein perturbation bound). Let Assumption 2.1 be satisfied with 
the numbers G G (0, oo) and p G [0,1), i.e., t[P‘^) < Gp^. Assume that there are numbers 
S G (0,1) and L G (0,oo) and a measurable Lyapunov function M : G —>■ [l,oo) of P sueh 
that 


{PV){x) < 6V{x)+L. 


(3.1) 


Let ^ 

w(4-P,4P) 

7 = sup-- 

xeG y{x) 

withpo{V)= f^V(x) dpo(x). Then 


and 


K = max < po{V) 


1-S 


W{pn,Pn) < G (^p^W{po,Po) + (1 - P")y^) ■ 
Proof. By induction one can show that 


(3.2) 


n—1 

Pu-Pn = {po-po)P^ + J2P^(P-P)P"~'~'^ (3.3) 

2=0 


We have 


W{piP,piP)< / W{SxP,Sa;P)dpi{x)<'y V{x)dpi{x). 
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Moreover, for i > 0 we have 

J^V{x)dp,ix) = J^PMix)dpo{x) < S^poiV) + < max|po(V^), 

so that we obtain WifpiP^piP) < 7 K. By this fact we have 

W{p,PP^-^-\piPP'^-^-^) < jK ■ (3.4) 

Then, by (3.3), (3.4) and the triangle inequality of the Wasserstein distance we have 

n—1 

W{pr,,Pn) < W{poP^,PoPn + 5] W(kPP"-*-\kPP"-*-^) 

i^O 

n—1 

< W(po,Po)r(P")+7«^r(P*). 

i^O 

Finally, by (2.4) we obtain t{P^) < '"^IZp ^ 1 which allows us to complete the 

proof. □ 

Remark 3.1. The parameter k is an upper bound on pi(V). It can be interpreted as 
a measure for the stability of the perturbed Markov chain. The parameter 7 quantifies 
with a weighted supremum norm the one-step difference between P and P. The use of the 
Lyapunov function increases the flexibility of the resulting estimate, since larger values of 
V compensate larger values of the Wasserstein distance between the kernels. Notice that 
the existence of a Lyapunov function satisfying (3.1) is weaker than assuming fo-uniform 
ergodicity of P since it is not associated with a small set condition. In particular, the 
condition is satisfied for any P with the trivial choice V{x) = 1 for all x G G, see Corollary 
3.2. As we will see in Section 4, allowing for non-trivial choices of V considerably increases 
the applicability of our results. 

If P has a stationary distribution, say tt € P, as a consequence of the previous theorem, 
we obtain bounds on the difference between tt and tt. 

Corollary 3.1. Let the assumptions of Theorem 3.2 be satisfied. Assume that P has a 
stationary distribution tt gP and let W (tt, tt) be finite. Then 

W{TT,n) < (3.5) 
I — p 1 — 0 

Proof. By Theorem 3.2 we obtain with pq = tt, pq = tt, the stationarity of the distribu¬ 
tions TT, TT and by letting n —>■ 00 that 

W{TT,n) < 

1 - p 
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By the Lyapunov condition and [16, Proposition 4.24], it holds that 

= / V{x)(St{x) < ^ 

Jg 1-4 

which leads to k < L/{1 — (5) and finishes the proof. 


□ 


Remark 3.2. It may seem artificial to assume lP(7r,7f) < oo but this is needed for 
the limit argument in the proof. This condition is often satisfied a priori. For example, 
it holds if the metric is bounded, i.e., swp^ y^Qd(x,y) is finite, or, more generally, if the 
distributions tt and tt possess a first moment in the sense that there exist Xo,xo G G such 
that 

/ (i(a;o, cc) d7r(a:) < oo, / d(lEo, cr) d7f(a;) < oo. 

Jg Jg 

As pointed out in Remark 3.1, we do not need to impose condition (3.1) to obtain a 
non-trivial perturbation bound: 

Corollary 3.2. Assume that Assumption 2.1 holds with the numbers C G (0, oo) and 
p G [0,1), i.e., t(P'^) < Cp^, and let 

7 := svcpW{S^P,6^P). 

x^G 


Then 

W{pn,Pn) < C ^p”VF(po,Po) + (1 - 

Proof. The statement follows by Theorem 3.1 with V(x) = 1 and L = 1 — S. □ 

Remark 3.3. For the trivial metric d{x, y) = 2-lx^y the last corollary states essentially 
the result of [33, Theorem 3.1], where instead of the general Wasserstein distance the 
total variation distance is used. There, the bound’s dependence on C and p can be further 
improved by using the a priori bound ti(P"') < 1 in addition to uniform ergodicity. For 
another metric d such an a priori bound is in general not available. 

Remark 3.4. Table 3.1 provides a detailed comparison between our Theorem 3.1 and 
the related Wasserstein perturbation result of Pillai and Smith, [35, Lemma 3.3]. An 
important ingredient in their estimate is a set G C G which can be interpreted as 
the part of G where both Markov chains remain with high probability. When a good 
uniform upper bound on W{SxP, 6xP) for all a: € G is available, we can choose G = G 
in [35, Lemma 3.3] and V(x) = 1 in Theorem 3.1. In that case, both results essentially 
simplify to Corollary 3.2. The results become entirely different when such a bound is not 
available or too rough. In our estimate, one then needs a non-trivial Lyapunov function 
for P and a uniform upper bound on W{SxP,SxP)/V{x). To apply their estimate, one 
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needs a uniform bound on W{SxP,SxP) for all x G G. In addition, a bound on 7r(G\ G), 
Lyapunov functions and estimates of the exit probabilities from G of both Markov chains 
need to be available. Finally, while [35, Lemma 3.3] requires slightly more regularity on 
the Lyapunov function, contractivity of the unperturbed transition kernel P (with G = 1) 
is not needed on the whole state space but only on G. 


Table 1. Comparison of the Wasserstein perturbation bound of [35, Lemma 3.3] and Theorem 3.1. 
Here p, 5 G [0,1), L, Cp, C,D £ (0, oo), H: G —>■ [0, oo), H: G —>■ [1, oo) and E(x) = d{x, y)d7r(j/). 



Assumptions of 
[35, Lemma 3.3] 

Assumptions of 
Theorem 3.1 

Convergence 

property 

. W(SxP,SyP) ^ 

3G C G s.t. sup - - --- < p 

^,yeG 

t{P") < Gp^ 

Lyapunov function 

PV{x) < SV{x) + L 

PV{x) < SV{x) + L 

PV{x) < SV[x) + L 

Drift regularity 

E[V(X„+i 1 = x,X„+i ^ G)] < C 

E[V{X^+i 1 X„ = x,X„+i 0 G)] < G 

G G s.t. d{x,p) < V{x) + Cp 

— 

Perturbation error 

7 := sup W{SxP, SxP) 

W{5xP,5xP) 

7 := sup 

xeG V(x) 

Regularity of tt 

/g\g L(a:)d7r(a:) < D 

7r(G \ G) small 

— 

Conclusion: 
Upper bound of 

P'^Eix) + + 

(iG + S'^iVix) + D) + Cp) niG \ G) + 

2(1 - P[{Xj}]ll U c G])(G + + Cp) 

Gp”E{x)+ 

^max{V(a:), 


3.2. Perturbation bounds for geometrically ergodic Markov 
chains 

In this section, we derive general perturbation bounds for geometrically ergodic Markov 
chains. First, we recall some results from [17], [26] and [36] which are helpful to apply 
our Wasserstein perturbation bounds in the geometrically ergodic case. Then we present 
the new estimates: 

• Corollary 3.3 is an application of Theorem 3.1 with Wasserstein distances replaced 
by F-norms of differences between measures. 
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• In Corollary 3.4, we show that having a Lyapunov function V for P is sufficient 
for our bounds if the transition kernels P and P are sufficiently close (in a suitable 
sense). 

• In Theorem 3.2, we provide a quantitative perturbation bound which still applies 
if we can only control the total variation distance between P{x, •) and P{x, •). To 
measure the perturbation in such a weak sense is new for geometrically ergodic 
Markov chains. 

A transition kernel P with stationary distribution tt is called geometrically ergodic if 
there is a constant p € [0,1) and a measurable function C: G ^ (0, oo) such that for 
TT-a.e. X € G we have 

\\P^{xr)-7r\\,^<Gix)pG 

For (/)-irreducible and aperiodic Markov chains, it is well known that geometric ergodicity 
is equivalent to M-uniform ergodicity, see [36, Proposition 2.1]. Namely, if P is geomet¬ 
rically ergodic, then there exists a 7r-a.e. finite measurable function P: G —>■ [l,oo] with 
finite moments with respect to tt and there are constants p € [0,1) and G € (0, oo) such 
that 


lP"(x,-) - 7r||^ := sup 


lfl<v 


Thus 


f(p)(P"(x,dp) -7r(dy)) 

"P^(xr)-yrllv 


<GV(x)p", xeG,neN. 


sup 


V(x) 


< Gp'' 


(3.7) 


The following result establishes the connection between P-norms and certain Wasserstein 
distances. It is basically due to Hairer and Mattingly [17], see also [26]. 


Lemma 3.1. Assume that V is lower semi-continuous on G. For x,y G G, let us define 
the metric 

1^0 X = y. 

Then, for any p,v € V we have 


\\p-,z\\v =Wdy{p,iz), (3.8) 

where Wdy denotes the Wasserstein distance based on the metric dy- 

Lower semi-continuity of V implies lower semi-continuity of dy, which leads to the 
duality formula (2.1) by [45, Theorem 1.14]. We thus impose the standing assumption of 
lower semi-continuity of V whenever we speak of P-uniform ergodicity in the following. 
In principle, this requirement can be removed and (3.8) remains true, but we do not go 
into further detail in that direction. In applications, this is typically not restrictive since 
V is continuous anyway. 
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By similar arguments as in the proof of [26, Theorem 1.1] we observe that (3.7) implies 
a suitable upper bound on 


Ty(P) = sup 


Wdyi 6 ^P, 6 yP) 


j:,yGG,x^y x,y^G,x^y “b ^iv) 

Lemma 3.2. If (3.7) is satisfied for the transition kernel P, then tv{P"') < Cp^. 

Proof. For any positive real numbers ai,a 2 ,&i ,&2 we have the following elementary 
inequality 


\\P{x,-)-Piy,-)\\y 


ai + 02 

hi + 62 


< max 


Oi 02 

Vl'V2 


(3.9) 


By (3.9) we obtain 


ry{p-) = sup 




< sup 

x,y^G, x^y 


|P"(x, •) - 7r||y + |lP"(y, •) - 7r||y 


,yGG,x^y dv{x,y) 

< SUP \\P-iyG)-n\\v 

- .Jg I Vix) ’ Viy) 

Now, by using (3.7) we obtain the assertion. 


= sup 
x^G 


V{x) + Viy) 

P^ix,-) - 7r||y 


Vix) 


□ 


The lemmas above and Theorem 3.1 lead to the following new perturbation bound for 
geometrically ergodic Markov chains. 

Corollary 3.3. Let P be V-uniformly ergodic, i.e., there are constants p £ [0,1) and 
C £ (0, oo) such that 


||P"(a;,-)-7r||^ < Cy(x)p”, a;£G,n£N. 

We also assume that there are numbers 6 £ (0,1) and L £ (0,oo) and a measurable 
Lyapunov function V : G ^ [1, oo) of P such that 

iPV)ix) < SVix) + L. (3.10) 


Let 


7 = sup 
x£G 


Pix,-) - Fix,-) 


Vix) 

with poiV) = J^Vix) dpoix). Then 


and «; = max<po(r) 


1-S 


\\Pn 


Pnlly ^ C 



Polly + (1 



(3.11) 
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Remark 3.5. In [41, Theorem 3.1], a related perturbation bound is proven. The conver¬ 
gence property of the unperturbed transition kernel is slightly weaker than our I^-uniform 
ergodicity, but also based on a kind of Jjyapunov function. More restrictively, there it is 
assumed that the difference of P" and P" for all n > 0 can be controlled. In addition, the 
perturbation error is measured with a weight given by the same Lyapunov function as in 
the convergence property of P, but by taking a supremum over a subset of test functions. 
With our approach we can take the supremum over all test functions and obtain similar 
estimates by setting po = tt. 

The next corollary demonstrates how the Lyapunov function of P can be replaced 
by a Lyapunov function of P, provided that the distance between the transition kernels 
is sufficiently small. Notice that assuming the existence of a Lyapunov function of P in 
addition to the I^-uniform ergodicity is a definition of constants rather than an additional 
requirement, see, e.g., [5]. 

Corollary 3.4. Let P be V-uniformly ergodic, i.e., there are constants p € [0,1) and 
C € (0, oo) such that 


|[P"(a;,-)- 7 r||^, < Cy(x)p”, x€G,n£N. 

Moreover, V\ G —?► [1, oo) is a measurable Lyapunov function of P, such that 

{PV){x) < 5V{x)+L 

with constants S € (0,1) and L G (0, oo). Let 
P{x,-) -P{x,-) 


7 = sup 
x^G 


V 


V{x) 


and K = max < po{V), 


1 — (5 — 7 


(3.12) 


with poiy) = Jq V(x) dpo(x). Ifj-\-S < 1, then 

\\Pn - PnWv < C (p" \\P0-P0\\v + (1 - • (3.13) 

Proof. It suffices to show that 

{PV){x) <{S + j)V{x) + L (3.14) 

and then to apply Corollary 3.3. We have 

((P - P)V){x) < |((P - P)V){x)\ < ||p(a;,.) - P{x, < 7 V{x) 

which implies (3.14). The assertion follows by the assumption that (5 -|- 7 < 1 and an 
application of Corollary 3.3. □ 
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Remark 3.6. For discrete state spaces and under the requirement po = po, a result 
similar to the previous corollary is obtained in [21, Theorem 3, Corollary 3]. The authors 
of [21] replace our constant k by maxo<i<nPi(V). This we could do as well, see the proof 
of Theorem 3.1. 


In the perturbation bound of Corollary 3.3, the function V plays two roles. In its 
first role, V appears in the F-uniform ergodicity condition and thus is used to quantify 
convergence of P. In its second role, V appears in the constant 7 , with which we compare 
P and P, as well as in the definition of the distance between and We can interpret 
7 of Corollary 3.3 as an operator norm oi P — P. To this end, let By be the set of all 
measurable functions /: G —>■ K. with finite 


I/I 


y := sup 




^ T 7- / \ 5 

gg V{x) 


(3.15) 


which means 

Bv = {f:G^R\\f\y<^}. 

It is easily seen that {By, Hy) is a normed linear space. In the setting of Corollary 3.3, 
we have ^ ^ 

|||P-P||| := sup (P-P)/ = 7 . (3.16) 

In Corollary 3.4, the more restrictive case F = F is considered. The corresponding 
operator norm |||P — P\\\bv^Bv appears in classical perturbation theory for Markov 
chains, see [20, 21]. But as discussed in [41, p. 1126] and [13] it might be too restrictive 
to measure the perturbation with this operator norm for F = F. 

By relying, e.g., on [28, Proposition 2] we have some flexibility in the choice of F. There 
it is shown that, for r £ (0,1), F-uniform ergodicity implies F''-uniform ergodicity. This 
leads to less favorable constants in the W-uniform ergodicity of P, but can relax the 
requirements on the similarity of P and P. Namely, with a Lyapunov function F of P 
we can apply Corollary 3.3 with a F''-uniformly ergodic P and 7 = |||P — Pllls^r-s-s^■ 
Unfortunately, this approach breaks down for r = 0. To see this, notice that F’’- 
uniform ergodicity with r = 0 is just uniform ergodicity which is not implied by geometric 
ergodicity. The next theorem overcomes this limitation by separating the two roles of the 
function F in the previous perturbation boimds. Roughly, we set F = 1 in the sense that 
we measure the distances between P and P as well as between and in the total 
variation distance. At the same time, we set F = F in the sense that we assume P is 
F-uniformly ergodic with Lyapunov function F. 


Theorem 3.2. Let P be V-uniformly ergodic, i.e., there are constants p £ [0,1) and 
C £ (0,oo) such that 


||P"(a;,-)-7r||^ < GF(a;)p”, a; £ G, n £ N. 
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Moreover, V \ G ^ [IjOo) is a measurable Lyapunov function of P and P, such that 
{PV){x) < SV{x) + L, and {PV){x) < V{x) + L, 
with constants S € (0,1) and L € (0,oo). Let 

P{x,-) -P{x,-) 


7 = sup 
x^G 


V{x) 


and K = Taax<po(y), 


L 


l-,5 


(3.17) 


with pq{V) = fQV(x) dpo(x). Then, for j € (0,exp(—1)) we have 


\\Pn-Pn\\t. < Cp^ \\p,-p4v + (2C(L + l))'°*5(^-^)-7log(7-'). (3.18) 


Proof. From the proof of Theorem 3.2 we know that 

n —1 

\\Pn - Pnlltv < IKPO - Po)^’”|ltv + ~ 


in—2—1 


2=0 


, n e N. 


By Lemma 3.2, we have 

IKPO -Po)^’”lltv ^ \\ih-Po)P'^\\v < C'p” 11^3 -Polly ■ 

Fix a real number r G (0,1) and let s = 1 — r. By considering (2.3) one can see that 
Ti{P) < 1. This leads to 


p,{P-P)P 


n—i—1 


< 


< 


p,{p-p)p 


n—i—1 


P,{p-p)p 


n— 2 —1 


UP-P) UP-P) 


We also have 


UP-P) 


< / 

Jg 


S.p - S^p 


dpj(a;) < 7 / V{x)dpi{x), 
Jg 


~ r ~ WdAS,,P,s,,p) f ~ 

Pi{P-P) _< / Wdy(4-P,4-P)dpi(x) < sup- ~ - / V{x)dpi{x). 

vJg x&g V{x) Jg 

Moreover, for i > 0 we obtain 

[ V{x)dp^{x)= f PM{x) dpo{x) < 5"po{V) + ^^ 

Jg Jg (1 - o) 


and, by 


Wdy(i5x^’,4-P) = inf ^ [ [ iy{z)+ V{y))l,,^yd^{y,z) 
^eM{s^p,s^p) Jg Jg 

< PV{x) + PV{x) < (1 + S)V{x) + 2L, 
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we have 


Then 


Wd4S,P,5,P) 

+ 1)' 

xGG V[x) 


\\Pn - P«|ltv < C'p” Ibo - Polly + 2"(L + IYYk Ty{py. 


2=0 


Finally, by Lemma 3.2 we obtain 


n —1 




nS — 


2=0 


1-p" s(l-p)' 


For 7 G (0,exp(—1)), we can choose the numbers r = 1 + log( 7 ) ^ and s = log (7 
This yields 7 ’' = exp(l )7 and the proof is complete. □ 


Remark 3.7. Let tt € V he a stationary distribution of P. Notice that by the as¬ 
sumption that V is Lyapunov function of P and [16, Proposition 4.24] it follows that 
7f(F) < L/(l — S). Further, by the F-uniform ergodicity of P we also know that 7r(F) is 
finite. Thus, ^ ^ 

IItI" — ^lly < < OO- 

Now, by Theorem 3.2 we can bound || 7 r —^||j.^, with pq = ir, po = n and by letting 
n — 00 . We obtain 

lk-^f||,v < ^ exp(l)7log(7-'). (3.19) 


Remark 3.8. Let us comment on the dependence of 7 . In Section 4.3, we apply Theo¬ 
rem 3.2 combined with (3.19) in a setting where we have 7 < it'dog(iV)/A^ for a constant 
K > 1 and some parameter TV G N of the perturbed transition kernel. For £ G (0,1) and 
any TV > we have 7 < exp(—1). Then, with some simple calculations, we 

obtain for po = po and TV > the bound 


max{||p„ -Pnlltv 


lk-7r||t^} < 


3k (2C'(L-k l))2/'°gW 

1-p 


TVlog(TV)2 

TV 


Remark 3.9. In the setting of Theorem 3.2, we can also interpret 7 as an operator 
norm. Namely, 

1 ||P —p||| = sup 


{P - P)f 


= 7- 


(3.20) 


Here the subscript “1” in |/|^ indicates V{x) = 1 for all x £ G, see (3.15). For £0 > 0 
and a family of perturbations (Pe)|e|<eo let 7 = |||P — Pe\\\Bl-^By —>■ 0 for £ —>• 0. This 
condition appears in [13, Theorem 1, condition (2)] and is an assumption introduced by 
Keller and Liverani, see [22]. 
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4. Applications 

We illustrate our perturbation bounds in three different settings. We begin with studying 
an autoregressive process also considered in [13]. After this, we show quantitative per¬ 
turbation bounds for approximate versions of two prominent MCMC algorithms, namely 
the Metropolis-Hastings and stochastic Langevin algorithms. 


4.1. Autoregressive process 


Let G = R and assume that (W„)„gNo is the autoregressive model defined by 

= CnXn-l + Zn, n G N. (4-1) 

Here Xq is an R-valued random variable, a € (—1,1) and (Zn)neN is an i.i.d. sequence of 
random variables, independent of Xg. We also assume that the distribution of Zi, say p,, 
admits a first moment. It is easily seen that (A„)„gNo is a Markov chain with transition 
kernel 

Fa(x,A)= / lA(ax + y)d^(y), 

JR 

and it is well known that there exists a stationary distribution, say tTo,, of Pq- 

Now, let the transition kernel Rs with a € (—1,1) be an approximation of Pa- For 
x,y € G, let us consider the metric which is given by the absolute difference, i.e., d{x, y) = 
\x — y\. We assume that |a — a| is small and study the Wasserstein distance, based on d, 
of poPa and PoP~ with two probability measures po and po on (R, H(R)). 

We intend to apply Theorem 3.1. Notice that for H: R —>■ [l,oo) with V{x) = 1 -I- |a:| 
we have 

PaV{x) < jaEj H(x) -b 1 — |5| -I- E \Zi\ 

which guarantees that condition (3.1) is satisfied with <5 = |5| and L = 1 — |5| -b E \Zi\. 
Furthermore 

W{5xPa,5yPa) < / \ax-z-ay + z\ Ap{z) <\a\\x - y\ = \a\d{x,y), 
leads to T{Pa) < ja]”. Similarly, one obtains 


W{SxPa,5xPs) ^ / \ax — z — ax + z\ diJ,{z) < \x\\a — a\ 

Jr 

which implies that 


We set 


Wi6xPa,SxPs) ^ 

sup-^-< \a — a\. 

a:GB V {x) 


K = l-bmax| [ |a;|dpo(a^), 

Ur l-|a|J 


imsart-bj ver. 2014/10/16 file: wasserstein_stylel601.tex date: February 27, 2017 




Perturbation theory for Markov chains via Wasserstein distance 


17 


and Pa,n = PoPaj P5,n = • Then, inequality (3.2) of Theorem 3.1 gives 

W{pa,n,ps,n) < \a\'^ W{po,po) + |a - 5| 

1 - |a| 

and for pq = pq we have 


W{pa,n,PS,n) < |a “ a| 


1 - |a| 


(4.2) 


(4.3) 


From the previous two inequalities one can see that if a is sufficiently close to a, then 
the distance of the distribution p^^n and ps,n is small. Let us emphasize here that we 
provide an explicit estimate rather than an asymptotic statement. 

Note that by [16, Proposition 4.24] and the fact that Ppg{x) < \l3\g{x) +E|Zi| with 
g{x) = jxj and/3 G {a, 5} we obtain |a;| d 7 r^(x) < oo, which leads to a finite lF( 7 ra, tt^). 
As a consequence we obtain for the stationary distributions of Pa and Pa by estimate 
(3.5) that 


W{Tra,TTs) < |a - a| 


l-|5|+E|Zi| 

(l-|a|)(l-|5|)- 


(4.4) 


The dependence on \a — 5| in the previous inequality cannot be improved in general. 
To see this, let us assume that Xo^a and X(j a are real-valued random variables with 
distribution tTo, and tt^, respectively. Then, because of the stationarity we have that 
Xi^a = ctXo^a + Zi and A'l 5 = aX^ a + ^1 are also distributed according to tTq, and TTg, 
respectively. Thus 


EZi 




1 -a 




EZi 

1 — 5 


Now, for 5 : R —^ M with g{x) = x, we have ||g||Lip < 1 and thus 


bF( 7 r«, 7 r 5 ) = sup 

ll/llLip<l 


> 


IG 

= la — 5 


/(x)(d7r„(x) - d7r5(a;)) 

= |EXo,a — 


x{dTra{x) - d7rs(a::)) 

|EZi| 


II-a| 11-51 


Hence, whenever MZi ^ 0 we have a non-trivial lower bound with the same dependence 
on |a — 5| as in the upper bound of (4.4). This fact shows that we cannot improve the 
upper bound. 

Let us now discuss the application of Corollary 3.4 and Theorem 3.2. Under the 
additional assumption that /i, the distribution of Zi, has a Lebe^ue density h, it is 
shown in [15, Section 4] that the autoregressive model (4.1) is also U-uniformly ergodic. 
Precisely, there is a constant C > 1 such that 

\\P:{x,-)-7ral^<C\arV{x). 
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Moreover, from [13, Example 1] we know that 

l|Pa(x, •) - Ps{x,-)\\y 


sup • 


V{x) 


does not go to 0 when a f a. Hence, Corollary 3.4 cannot quantify for small \a — 
a I whether the nth step distributions are close to each other. However, also in [13, 
Example 1] it is proven that 


sup 


\Paix, •) - Ps(3 

Vix) 


—?> 0 if a —>• a. 


This indicates that Theorem 3.2 is applicable. By assuming in addition that h is weakly 
unimodal^ and bounded from above by /imax, we can quantify the result. Namely, 


sup ■ 


\\Pa{x,-) - Ps{3 
V{x) 


f •)lltv _ „„„ IIm(- - 

~ 1 I I I 

aiGiR 1 + FI 

\ h{z — ax) — h{z — ax) \ dz 


= sup 

aiGM. 


i + bl 


< 2\a — a\hr. 


To see the final estimate, define F{a) = /jj \h{z) — h{z — a)|d 2 ; for a € M. By unimodality, 
there exists for any fixed a > 0 a constant c such that 

n pc nOO 

/ \h{z) — h{z — a)\dz = / h{z) — h{z — a)Az + / h{z — a) — h{z)dz. 

J —oo J c 

The first summand on the right hand side we can bound by 

/ h{z)dz — / h{z — a)dz = / h{z)dz < a 

-OO J —OO J c—a 

and similarly for the second summand. Using that F{a) = F{—a), we obtain F{a) < 
2|a| hmax- Finally, by substitution we can write 

J^\h{z - ax) - h{z - 5x)\dz F{a) ^ 

sup- , , I I - = \a- a\ sup ——j-^ < 2|a - a|ftmax- 

xeR 1 + fI a>0 O'+ ~ Ot] 


For simplicity set po = po and assume that /imax < 1 as well as |a — a| G (0, exp(—1)/2). 
Then, Theorem 3.2 implies 

max{||pa.n -P5.n|ltv> Ika “ TTsjl^^} < (2C'(E|Zi| + 2)) |a - ajlogdo; - 5|"^) 

which seems to be new. 

^The function h:M. ^ [0, oo) is called weakly wnimodal if there exists s G M such that h(x) is 
nondecreasing for x G (—oo,s) and nonincreasing for x £ (s,oo). 
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We apply our perturbation results to the approximate (or noisy) Metropolis-Hastings 
algorithms analyzed in [2, 3, 4, 23, 29, 35]. We assume either that the unperturbed tran¬ 
sition kernel of the Metropolis-Hastings algorithm satisfies the Wasserstein ergodicity 
condition stated in Assumption 2.1 or is geometrically ergodic. In particular, we do not 
assume that the transition kernel is uniformly ergodic. Let tt be a probability distribu¬ 
tion on {G,B{G)) and assume that we are interested in sampling realizations from this 
distribution. Let Q be a transition kernel which serves as the proposal for the Metropolis- 
Hastings algorithm. From [44, Proposition 1] we know that there exists a set S C G x G 
such that we can define the “acceptance ratio” for {x,y) G G x G as 


r{x,y) 


Tr{dx)Q(x,dy) 

0 


{x,y) G S 
otherwise. 


(4.5) 


Then, let the acceptance probability be a{x, y) = min{l, r{x, y)}. With this notation the 
Metropolis-Hastings algorithm defines a transition kernel 


Pa{x,dy) = Q{x,dy)a{x,y) -I- dx{dy) Sa{x), 


(4.6) 


with 

Sa{x) = l- a{x,y)Q{x,dy). 

JG 

We provide a step of a Markov chain (A„)„gNo with transition kernel Pa in algorithmic 
form. 


Algorithm 4.1. A single transition from to X^+i of the Metropolis-Hastings algo¬ 
rithm works as follows: 

1. Draw a sample Y ~ Q{Xm-) and U ~ [/m/[0,1] independently, call the result y 
and u; 

2. Set r := r{Xn,y), with the ratio r(-,-) defined in (4.5); 

3. If u < r, then accept the proposal, and set := y, else reject the proposal and 

set .= X^i- 

Now, suppose we are unable to evaluate r{x,y), so that we are forced to work with 
an approximation of a{x,y). The key idea behind approximate Metropolis-Hastings al¬ 
gorithms is to replace r(x, y) by a non-negative random variable R with distribution, say 
yx,y,u, depending on x,y € G and u G [0, Ij. For concrete choices of the random variable 
R we refer to [2, 3, 4, 23]. We present a step of the corresponding Markov chain (A„)„gN 
in algorithmic form. 

Algorithm 4.2. A single transition from A„ to Xn+i works as follows: 

1. Draw a sample Y ^ Q{Xm-) and U ^ Unij{0,l] independently, call the result y 
and u; 
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2. Draw a sample i? ~ ^ call the result r; 

3. If u < r, then accept the proposal, and set Xn+i := y, else reject the proposal and 

set := Xji. 


The algorithm has acceptance probability 

^OO 

a(a;, 2 /) = ]El[o,min{l._R}](C^) = / / 1[0, min{l,?}](«) 

JO Jo 

and the transition kernel of such a Markov chain is still of the form (4.6) with a{x,y) 
substituted by a{x, y), i.e., it is given by Pa- The following results hold in the slightly more 
general case where a{x,y) is any approximation of the acceptance probability a(x,y). 

The next lemma provides an estimate for the Wasserstein distance between transition 
kernels of the form (4.6) in terms of the acceptance probabilities. 

Lemma 4.1. Let Q be a transition kernel on (G,B{G)) and let a: G x G ^ [0, 1] and 
a: Gx G —> [0,1] be measurable functions. By Pa and Pa we denote the transition kernels 
of the form (4.6) with acceptance probabilities a and a. Then, for all x G G, we have 


W{5,^Pa,5,^Pa) < / d{x,y)£{x,y)Q{x,dy) 

Jg 

with £{x,y) = \a{x,y) - a{x,y)\. 

Proof. By the use of the dual representation of the Wasserstein distance it follows that 


W{Sj;Pa,Sj;Ps) = sup 


fiy) iPaix,dy) - Ps{x,dy)) 


= sup 

ll/llLip<l 


{f{y) - f{x)){a{x,y) - a{x,y))Q{x,dy) 


< / d{x,y)£{x,y)Q{x,dy). 
Jg 


□ 

By the previous lemma and Theorem 3.1, we obtain the following Wasserstein pertur¬ 
bation bound for the approximate Metropolis-Hastings algorithm. 

Corollary 4.1. Let Q be a transition kernel on {G,B{G)) and let a: GxG —>■ [0,1] and 
d: Gx G —> [0,1] be measurable functions. By Pa and Pa we denote the transition kernels 
of the form (4.6) with acceptance probabilities a and d. Let the following conditions be 
satisfied: 

• Assumption 2.1 holds for the transition kernel Pa, i.e., T{Pa) < Gp^ for p G [0,1) 
and G € (0, oo). 
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• There are numbers 6 € (0,1), L G (0,oo) and a measurable Lyapunov function 
F : G —>• [1, oo) of Pa, i.e., 


{PaV){x) < 5V{x) + L. 

• Let £{x,y) = \a{x,y) —a{x,y)\ and assume that 

Jcd{x,y)£{x,y)Q{x,dy) 


7 = sup 

xGG 


V{x) 


< oo. 


Then, for any pq and finite poiV) = V{x)dpo{x) we have 


WipoPf:,PoPS) < 

where k = max|po(l^)) 


j kC{1 — p”) 

1 - p 


(4.7) 


(4.8) 


Let us point out several aspects of condition (4.7). Recall that (4.7) is always satisfied 
with V(x) = 1 for all x G G. However, in this case it seems more difficult to control 7 . 
If some additional knowledge in form of a Lyapunov function H: G —> [1, 00 ) of Pa, i.e., 
PaV{x) < 6V{x) + L for some 6 G (0,1) and L G (0, 00 ), is available, then a non-trivial 
candidate for V is V. For sufficiently small 

Sy = sup / (+ 1 ] £{z, y)Q(z, dy) 

z&GJG\yG) ) 

this is indeed true. Namely, we have 

\{Pa - Ps)V{x)\ < [ V{y)£{x,y)Q{x,dy) + V{x) [ £{x,y)Q{x,dy) <V{x)6v. 

JG JG 

Then, PsV{x) < (d + dy)V{x) + L and whenever 5 + dy < 1 it is clear that condition 
(4.7) is verified. 

To highlight the usefulness of a non-trivial Lyapunov function, we consider the fol¬ 
lowing scenario which is related to a local perturbation of an independent Metropolis- 
Hastings algorithm. 


Example 4.1. Let us assume that for Pa Assumption 2.1, as formulated in Corol¬ 
lary 4 . 1 , is satisfied. For some probability measure pt on {G,B{G)) define Q(x, ■) = p and 
Po = Po = P- For G C G let 

d{x, y) = min{l, a{x, y) + lg(a:)}. 

Hence, for x G G the transition kernel Ps{x, •) accepts any proposed state and for x ^ G 
we have Pa{x,-) = Pa{x,-). It is easily seen that £(x,y) < lg(x). For arbitrary i? > 0 
and r G (0,1) set V{x) = 1-1- R1q{x) and note that 

PaV{x) < rV{x) + 1 — r -\- RPs{x, G) < rV(x) -I- 1 — r -|- Rp{G). 
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The last inequality of the previous formula follows by distinguishing the cases x G G and 
X ^ G. Define D{G) = sup^^g d{x,y)pL{dy) and observe 


1 — r 

Then, Corollary 4-1 leads to 


and 


7 < 


D{G) 

1 + R' 


w{poPf:,poP~) < 



RhiG) \ D{G) 
1 — r j 1 + R 


for arbitrary R G (0,oo) and r G (0,1). Under the assumption that D{G) is finite and 
letting R ^ oo as well as r 0 we obtain 


W{poP:,PoPS) < 


Gn{G)D{G) 

1- p 


which tells us that basically p{G) measures the difference of the distributions. A small 
perturbation set G with respect to p, thus implies a small bias. In contrast, with the trivial 
Lyapunov function V = 1, and if there is {x,y) G G x G such that a{x, y) = 0, we only 
obtain 

7 K = D{G) > inf f d{x,y)p{dy). 

x£GJq 

The resulting upper bound on W(jioPff,pQP~) will typically be bounded away from zero 
regardless of the set G. 


Remark 4.1. The constant 7 essentially depends on the distance d{x, y) and the differ¬ 
ence of the acceptance probabilities £(x,y). By applying the Cauchy-Schwarz inequality 
to the numerator of 7 , we can separate the two parts, i.e., 


d[x,y)£{x,y)Q{x,dy) < [ / d{x,yf Q{x,dy) ■ / £{x,y)'^ Q{x,dy) 


1/2 


If both integrals remain finite we see that an appropriate control of £(x,y) suffices for 
making the constant 7 small. 


Remark 4.2. By using a Hoeffding-type bound, in Bardenet et al. [3, Lemma 3.1.] it 
is shown that for their version of the approximate Metropolis-Bastings algorithm with 
adaptive subsampling the approximation error £{x,y) is bounded uniformly in x and y 
by a constant s > 0. Moreover, s can be chosen arbitrarily small for the implementation 
of the algorithm. 


Now we consider the case where the unperturbed transition kernel Pa is geometrically 
ergodic. Motivated by Remark 4.2, we also assume that £(x,y) < s for a sufficiently 
small number s > 0. The following corollary generalizes a main result of Bardenet et al. 
[3, Proposition 3.2] to the geometrically ergodic case. 
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Corollary 4.2. Let Q be a transition kernel on (G, B{G)) and let a: Gx G —>■ [0,1] and 
5: Gx G —>■ [0,1] be measurable functions. By Pa and Pa we denote the transition kernels 
of the form (4.6) with acceptance probabilities a and a. Let the following conditions be 
satisfied: 

• The unperturbed transition kernel Pa is V-uniformly ergodic, that is, 

\\P:{x,-)-7r\\y<CV{x)p^, xGG,neN 

for numbers p G [0,1), G G (0,oo) and a measurable function V: G ^ [l,oo). 
Moreover, V is a Lyapunov function of Pa, i.e., 

(PaV){x) < 6V{x) + L, (4.9) 

for numbers S G (0,1) and L G (0,c»). 

• A uniform bound s > 0 on the difference of the acceptance probabilities is given, 
that is, for all x,y G G, we have 

£{x,y) = \a{x,y) - a{x,y)\ < s. 


The constant A satisfies 


A = 1 + sup / 

xeG JG 


vjy) 

V{x) 


Q{x,dy) < oo. 


If s < {1 — S)/X, then, for any po G V with finite k = max|po(l^), i_s-\e } have 

AskG( 1 — p”) 


wpop: - PoPswv < 


1-p 


Proof. We consider the metric dy, defined in Lemma 3.1, set F = lA and use £{x, y) < s 
so that it is easily seen that the constant 7 from Corollary 4.1 satisfies 7 < sA. From 
the proof of Corollary 3.4, we know that 14 is a Lyapunov function of Pa provided that 
7 + i5 < 1. Thus, we have 


P?iV{x) < (^ + Xs)V{x) + L. (4.10) 

Now if s < (1 — (5)/A, then d + As < 1 and the assertion follows from Corollary 4.1 by 
writing the Wasserstein distances in terms of Wnorms as in Section 3.2. □ 

Remark 4.3. Without V(x) in the denominator, i.e., if we had relied on Corollary 3.2 
instead of Theorem 3.1, the constant A would often be infinite. Consider the following toy 
example: Let tt be the exponential distribution with density exp(—x) on G = [0,oo) and 
assume that Q{x, dy) is a uniform proposal with support [x—1, x+1]. With V(x) = exp(x) 
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it is well known that the Metropolis-Hastings algorithm is ^-uniformly ergodic, see [30] 
or [37, Example 4]. In this example 

rx+l 

A < 1 + sup / exp(?/ — x)dy < 1 + exp(l) 

xG[0,oo) Jx — 1 

whereas exp(?/)d?/ is unbounded in x. Notice that A only depends on the unperturbed 
Markov chain so that a bound on A can be combined with any approximation. 


Remark 4.4. Let Pg and Pa be (/)-irreducible and aperiodic. Then, one can prove 
under the assumptions of Corollary 4.2 that Ps is E-uniformly ergodic if s is sufficiently 
small. To see this, note that by [31, Theorem 16.0.1] the E-uniform ergodicity of Pa 
implies that Pa satisfies their drift condition (V4). By the arguments stated in the proof 
of Corollary 3.4, one obtains that P 5 also satisfies (V4) for sufficiently small s and this 
implies E-uniform ergodicity. In this case, clearly Pa possesses a stationary distribution, 
say TT, and 


Ik-T^lly < 


XsC 


L 

1 — 5 — \s 


The previous inequality follows by (3.5) and the fact that 


Ik — 7r||y < + '^(^) < OO- 

Here the hniteness of 7 r(E) follows by the E-uniform ergodicity of P and ^(E) < L/{1 — 
6 — As) follows by (4.10) and [16, Proposition 4.24]. 


4.3. Noisy Langevin algorithm for Gibbs random fields 


An alternative to the Metropolis-Hastings algorithm is the Langevin algorithm, see [39] . 
Unfortunately, in its implementation one needs the gradient of the density of the target 
distribution. To overcome this problem, different approximate Langevin algorithms have 
been proposed and studied, see [1, 2, 43, 47]. 

This section is mainly based on Alquier et al. [2, Section 3.4] where a noisy Langevin 
algorithm for Gibbs random fields is considered. We provide a quantitative version of 
[2, Theorem 3.2]. The setting is as follows. Let E be a finite set and with M £ N let 
y = {yi ,..., 2 /m} G be an observed data set on nodes { 1 ,..., Mj of a certain graph. 
The likelihood of y with parameter 0 £ K is defined by 


^{y\0) 


exp( 6 >s( 2 /)) 
Hy^yM exp(ds(y))’ 


where s: R is a given statistic. The density of the posterior distribution with 

respect to the Lebesgue measure on (R, H(R)) given the data y £ y^ is determined by 


-Ky^e) := 7r(6» I y) oc ^{y \ 9)p{e) 
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where the prior density p{9) is the Lebesgue density of the normal distribution Af{0, a^) 
with CTp > 0 . 

We consider the Langevin algorithm, a first order Euler discretization of the SDE of 
the Langevin diffusion, see [39]. It is given by with 

2 

Xn = Xn-l + —V logTTy{Xn-l) + Zn, UGN. (4-11) 

Here Xq is a real-valued random variable and (Zn)neN is an i-i-d. sequence of random 
variables, independent of Xq, with ~ for a parameter cr > 0 which can be 

interpreted as the step size in the discretization of the diffusion. It is easily seen that 
(-^n)nGNo is a Markov chain with transition kernel 

P^{9,A)= A (^9+^yiogny{9)+z'^M{0,a^){dz), H G H(R). 

In general tt^ is not a stationary distribution of P^, but there exists a stationary dis¬ 
tribution (see Proposition 4.1 below), say tTo-, which is close to Wy depending on a. Let 
= SyeyM exp( 0 s(?/)) then, by the definition of iTy we have 

log TTy{9) = 9 s{y) - log z{9) + \ogp{9) -\og(^J i{y \ z)p{z)dz'^ , 
z'(9) 

Vlog 7 ry( 6 ») = s{y) - +X\ogp{9) 

f X EzeyM s( 2 ;)exp( 6 »s(z)) 9 

EzGyMexp( 0 s(z)) 

= s(2/) -IE^(.|e)s(y) - -4, 

where E is a random variable on distributed according the likelihood distribution de¬ 
termined by £{■ I 9). We do not have access to the exact value of the mean E|(.|y)s(E) since 
in general we do not know the normalizing constant of the likelihood. We assume that we 
can use a Monte Carlo estimate. Eor iV G N let (Ei)i<i<^ be an i.i.d. sequence of random 
variables with Yi ~ £{■ \ 9) independent of from (4.11). Then, E^i 'S(E) is 

an approximation of Ef(.|y)s(E) which leads to an estimate of V log 7 ry( 0 ) given by 

V^log 7 ry( 6 ») := s{y) - 

We substitute Vlog 7 ry( 0 ) by V'^log 7 ry( 0 ) in (4.11) and obtain a sequence of random 
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variables defined by 


Xn — Xn-1 + -;rV log7ry(X„_i) + Z, 


N 


“ ( ^ ^ ~ ~lv^ 1 + 


The sequence (X„)„gNp is again a Markov chain with transition kernel 

p,Me.A} = l Y, + 

“(y!. P'^ \ i=i / 


xUlAS\y')mo,<jn{dz) 



for 0 G R and A G B{M.). Let us state a transition of this noisy Langevin Markov chain 
according to Pa-,N in algorithmic form. 


Algorithm 4.3. A single transition from Xn to Xn+i works as follows: 

1. Draw an i.i.d. sequence (Ti)i<i<Ar with ~ £(• | A„), call the result {y[,... 

2. Calculate ^ ^ 

V^log7ry(A„) := s{y) - 4 XI 

i=i P 

3. Draw Zn ~ A/'(0,(t^), independent from step 1., call the result Zn- Set 

2 

Xn+l = Xn + log TTy(Xn) + Zn- 

From [2, Lemma 3] and by applying arguments of [39], we obtain the following facts 
about the noisy Langevin algorithm. 


Proposition 4.1. Let ||s||qo = sup 2 g 3 ;M |s(z)| be finite with ||s||j^ > 0, Zet fo: R —>■ 
[1, oo) be given by V (0) = 1 + |0| and assume that cA' < 4tTp. Then 

1. the function V is a Lyapunov function for Pa- and Pa,N ■ We have 

PaV{9) < SV{0) + Llj{0), Pa,NV{0) < SV{0) + Lli{9) (4.12) 


with S = l-£^,L = a + a^ ||s||^ 



and the interval 


I = 


9 G 


|0|<l + 4a2||s|U 



2. there are distributions tTo- and TTa^N on (R, S(R)) which are stationary with respect 
to Pa and Pa,N, respectively. 
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3. the transition kernels and Pa,N o,re V-uniformly ergodic. 

4- /or TV > 4max{||s||^(T^, we have 

sup||P^(6l,-)-P,T,7v(6',-)lltv ^6max{||s||ooC^^||s||;;„V“^} (4.13) 

fl(=R -‘V 


Proof. We use the same arguments as in [39, Section 3.1]. One can easily see that the 
Markov chains (Xn)„gNo a-nd (-^n)riGNo are irreducible with respect to the Lebesgue 
measure and weak Feller. Thus, all compact sets are petite, see [31, Proposition 6.2.8]. 
Hence, for the existence of stationary distributions, say tto- and tto-.tv, [31, Theorem 12.3.3], 
as well as for the H-uniform ergodicity [31, Theorem 16.0.1] it is enough to show that V 
satishes (4.12). With Z ^ A/'(0,(T^), we have 


P(tV{9) < ^1 — Ny) “ ]E^(-|6»)'S(d^)| + E \Z\ 


By the fact that 


E 


1 ^ 

2=1 


Xr.=e 


<2||s||. 


we obtain with the same arguments that 


Pa,NV{0)<5v{e) + L-ii{e). 


Thus, the assertions from 1. to 3. are proven. The statement of 4. is a consequence of [2, 
Lemma 3]. There it is shown that for TV > 4||s||^cr^ it holds that 


sup||P„(6»,-) 


^<T,7v(d,-)lltv ^ 


( log(iV) \ 
VdTVljsIP^a^; 


. 4V^||s||oog^ 

TV 


By using exp(0) — 1 < dexp(0) and TV > 4 we further estimate the right-hand side by 


/ KN,s,a 4v^||s||ooCr^\ log (TV) 
log(5) )' TV 


with 


KN,s,cr = exp 


/ log(TV) \ 

VdTVljsll^a^; ■ 


Since log(TV)-TV“^/^ < 2, we have the bound KN.s^a < exp(l) provided that 4TV^/^||s||^cr"^ 
2 which follows from TV > The assertion of (4.13) follows now by a simple 

calculation. □ 
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By using the facts collected in the previous proposition, we can apply the perturba¬ 
tion bound of Theorem 3.2 and obtain a quantitative perturbation bound for the noisy 
Langevin algorithm. 


Corollary 4.3. Let po be a probability measure on (R, S(R)) and set pn = poPff as 
well as Pn,N = PoPaN- Suppose that cr^ < 4 ( 7 ^. Then, there are numbers p € [0,1) and 
C € (0, oo), independent ofn,N, determining 




with¥,pg\X\ = /g |0| dpo (^*)7 so that for N > 90m.ax{\\ s\\l^a'^,\\s\\^ cr ®} we have 

. 2 /log(Ar) log(iV)2 


max {||p„ - p„,7v|ltv : h<T - TT^.wllt^} <R-{2C{a + cr^ ||s||^ + 3 ))^ 


N 


Proof. We have by Proposition 4.1 that is P-uniformly ergodic with V{6) = l-\- 1^|, 
i.e., there are numbers p G [0,1) and C € (0, oo) such that 


sup 


\\p^{e,-)-TT^\y 

v{e) 


< Cp^. 


Now, by combining Theorem 3.2 and Remark 3.8 with the results from Proposition 4.1 
we obtain the result. □ 


Remark 4.5. We want to point out that the assumptions imposed are the same as in 
[2, Theorem 3.2], but instead of the asymptotic result we provide an explicit estimate. 
The numbers p £ [0,1) and C £ (0, oo) are not stated in terms of the model parameters. 
In principle, these values can be derived from the drift condition (4.12) through [5, 
Theorem 1.1]. 
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